Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Aug 24, 2025

if non-constant stackalloc (without SkipLocalsInit) is used mostly with the same size, we can convert it from:

var _ = stackalloc T[N]

to

var _ = N == ProfiledConst ? stackalloc T[ProfiledConst] : stackalloc T[N];

so then it can benefit from faster zeroing.

Benchmark

using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);

public class Benchmarks
{
    [Benchmark]
    [Arguments(2)]
    [Arguments(16)]
    [Arguments(64)]
    [Arguments(100)]
    [Arguments(300)]
    [Arguments(512)]
    [Arguments(2048)]
    [Arguments(16*1024)]
    public void NonConstantStackalloc(int n) => Consume(stackalloc byte[n]);

    [MethodImpl(MethodImplOptions.NoInlining)]
    void Consume(Span<byte> span){}
}

Benchmark results on linux_azure_cascadelake

Method Toolchain n Mean Error Ratio
NonConstantStackalloc Main 2 1.726 ns 0.0004 ns 1.00
NonConstantStackalloc PR 2 1.725 ns 0.0005 ns 1.00
NonConstantStackalloc Main 16 1.726 ns 0.0004 ns 1.00
NonConstantStackalloc PR 16 1.726 ns 0.0006 ns 1.00
NonConstantStackalloc Main 64 3.164 ns 0.0007 ns 1.00
NonConstantStackalloc PR 64 1.839 ns 0.0005 ns 0.58
NonConstantStackalloc Main 100 5.994 ns 0.0337 ns 1.00
NonConstantStackalloc PR 100 1.840 ns 0.0002 ns 0.31
NonConstantStackalloc Main 300 7.746 ns 0.0188 ns 1.00
NonConstantStackalloc PR 300 1.939 ns 0.0255 ns 0.25
NonConstantStackalloc Main 512 11.353 ns 0.0384 ns 1.00
NonConstantStackalloc PR 512 4.177 ns 0.0008 ns 0.37
NonConstantStackalloc Main 2048 39.302 ns 0.0123 ns 1.00
NonConstantStackalloc PR 2048 19.214 ns 0.0035 ns 0.49
NonConstantStackalloc Main 16384 303.593 ns 0.0450 ns 1.00
NonConstantStackalloc PR 16384 85.821 ns 0.0411 ns 0.28

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 24, 2025
@EgorBo

This comment was marked as outdated.

@dotnet-policy-service

This comment was marked as resolved.

@EgorBo EgorBo closed this Aug 24, 2025
@EgorBo EgorBo reopened this Aug 24, 2025
@EgorBo
Copy link
Member Author

EgorBo commented Aug 24, 2025

@EgorBot -amd -intel -arm

using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);

public class Benchmarks
{
    [Benchmark]
    [Arguments(2)]
    [Arguments(16)]
    [Arguments(64)]
    [Arguments(100)]
    [Arguments(300)]
    [Arguments(512)]
    [Arguments(2048)]
    [Arguments(16*1024)]
    public void NonConstantStackalloc(int n) => Consume(stackalloc byte[n]);

    [MethodImpl(MethodImplOptions.NoInlining)]
    void Consume(Span<byte> span){}
}

@ymalich
Copy link

ymalich commented Aug 24, 2025

thanks!

@EgorBo EgorBo marked this pull request as ready for review September 3, 2025 11:04
Copilot AI review requested due to automatic review settings September 3, 2025 11:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Profile-Guided Optimization (PGO) instrumentation for stackalloc operations to improve performance by enabling constant-size optimization for frequently used stack allocation sizes.

Key changes:

  • Introduces profiling and optimization for non-constant stackalloc operations
  • Refactors profile value picking into a reusable utility function
  • Creates specialized tree node type for stackalloc to store IL offset information

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/coreclr/jit/importercalls.cpp Refactors profile value picking logic into reusable pickProfiledValue function
src/coreclr/jit/importer.cpp Adds PGO instrumentation and optimization for stackalloc operations
src/coreclr/jit/gtstructs.h Maps new GenTreeOpWithILOffset struct to GT_LCLHEAP node type
src/coreclr/jit/gtlist.h Updates GT_LCLHEAP to use new specialized node type with IL offset
src/coreclr/jit/gentree.h Defines new GenTreeOpWithILOffset struct for storing IL offset
src/coreclr/jit/gentree.cpp Adds support for new node type in comparison, hashing, and cloning operations
src/coreclr/jit/fgprofile.cpp Extends value profiling infrastructure to handle stackalloc operations
src/coreclr/jit/compiler.h Adds declarations for new utility functions
src/coreclr/jit/block.h Adds schema index field for value instrumentation

int bbHistogramSchemaIndex; // schema index for histogram instrumentation
};

int bbValueSchemaIndex; // schema index for value instrumentation
Copy link
Member Author

@EgorBo EgorBo Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Count and HandleHistogram each have their own index fields, Value probing used to use Handle's one and it could lead to asserts. This field doesn't increase BasicBlock's layout (it had paddings) - still same 272 bytes on Release-64bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work if there are multiple value probes in a block?

op1 = gtNewOperNode(GT_LCLHEAP, TYP_I_IMPL, op2);
// May throw a stack overflow exception. Obviously, we don't want locallocs to be CSE'd.
op1->gtFlags |= (GTF_EXCEPT | GTF_DONT_CSE);
op1 = gtNewLclHeapNode(op2, opcodeOffs);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move the entire CEE_LOCALLOC importation to a separate function in a separate PR in order to simplify code-review

@EgorBo
Copy link
Member Author

EgorBo commented Sep 3, 2025

PTAL @AndyAyersMS @dotnet/jit-contrib

This significantly speeds up non-constant stackalloc zeroing with help of Value Profiling. I had to introduce struct GenTreeOpWithILOffset : public GenTreeOp in order to keep IL_OFFSET for GT_LCLHEAP -- we've discussed this option in Discord recently.

@EgorBo EgorBo requested a review from AndyAyersMS September 3, 2025 11:42

if (lengthNode->TypeGet() != TYP_I_IMPL)
{
lengthNode = compiler->gtNewCastNode(TYP_I_IMPL, lengthNode, /* isUnsigned */ false, TYP_I_IMPL);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, all memset/memcpy primitives used TYP_I_IMPL length. GT_LCLHEAP uses TYP_INT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ecma spec is weird here:

III.3.47 
...
The localloc instruction allocates size (type native unsigned int or U4) bytes from the local

at any rate, seems like representing it as TYP_I_IMPL would be ok

Co-authored-by: Jakob Botsch Nielsen <[email protected]>
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try having two variable-sized stackallocks back to back and verify we get the right profile for each?

int bbHistogramSchemaIndex; // schema index for histogram instrumentation
};

int bbValueSchemaIndex; // schema index for value instrumentation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work if there are multiple value probes in a block?


if (lengthNode->TypeGet() != TYP_I_IMPL)
{
lengthNode = compiler->gtNewCastNode(TYP_I_IMPL, lengthNode, /* isUnsigned */ false, TYP_I_IMPL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ecma spec is weird here:

III.3.47 
...
The localloc instruction allocates size (type native unsigned int or U4) bytes from the local

at any rate, seems like representing it as TYP_I_IMPL would be ok

}
else
{
// NOTE: we don't want to convert the fastpath stackalloc to a local like we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this block is executed frequently enough maybe we should convert to a local? You can compare the block's weight to that of the method entry, if the value is close to 1 then consider the conversion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants